Language Trees and Zipping 
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In this letter we present a very general method to extract information from a generic string of 
characters, e.g. a text, a DNA sequence or a time series. Based on data-compression techniques, its 
key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. 
We present the implementation of the method to linguistic motivated problems, featuring highly 
accurate results for language recognition, authorship attribution and language classification. (PACS: 
89.70.+c,05.) 
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Many systems and phenomena in nature are often rep- 
resented in terms of sequences or strings of characters. 
In experimental investigations of physical processes, for 
instance, one typically has access to the system only 
through a measuring device which produces a time record 
of a certain observable, i.e. a sequence of data. On the 
other hand other systems are intrinsically described by 
string of characters, e.g. DNA and protein sequences, 
language. 

When analyzing a string of characters the main ques- 
tion is to extract the information it brings. For a DNA se- 
quence this would correspond to the identification of the 
sub-sequences codifying the genes and their specific func- 
tions. On the other hand for a written text one is inter- 
ested in understanding it, i.e. recognize the language in 
which the text is written, its author, the subject treated 
and eventually the historical background. 

The problem cast in such a way, one would be tempted 
to approach it from a very interesting point of view: that 
of information theory |], ||. In this context the word 
information acquires a very precise meaning, namely that 
of the entropy of the string, a measure of the surprise the 
source emitting the sequences can reserve to us. 

As it is evident the word information is used with dif- 
ferent meanings in different contexts. Suppose now for 
a while to be able to measure the entropy of a given se- 
quence (e.g. a text). Is it possible to obtain from this 
measure the information (in the semantic sense) we were 
trying to extract from the sequence? This is the question 
we address in this paper. 

In particular we define in a very general way a concept 
of remoteness (or similarity) between pairs of sequences 
based on their relative informatic content. We devise, 
without loss of generality with respect to the nature of 
the strings of characters, a method to measure this dis- 
tance based on data-compression techniques. The specific 
question we address is whether this informatic distance 
between pairs of sequences is representative of the real 
semantic difference between the sequences. It turns out 
that the answer is yes, at least in the framework of the 
examples on which we have implemented the method. 



We have chosen for our tests some textual corpora and 
we have evaluated our method on the basis of the re- 
sults obtained on some linguistic motivated problems. Is 
it possible to automatically recognize the Language in 
which a given text is written? Is it possible to automat- 
ically guess the author of a given text? Last but not 
the least, it is possible to identify the subject treated in 
a text in a way that permits its automatic classification 
among many other texts in a given corpus? In all the 
cases the answer is positive as we shall give evidences in 
the following. 

Before entering in the details of our method let us 
briefly recall the definition of entropy which is closely 
related to a very old problem, that of transmitting a mes- 
sage without loosing information, i.e. the problem of the 
efficient encoding ||. 

The problem of the optimal coding for a text (or an 
image or any other kind of information) has been enor- 
mously studied in the last century. In particular Shan- 
non |l]] discovered that there is a limit to the possibility 
to encode a given sequence. This limit is the entropy of 
the sequence. There are many equivalent definitions of 
entropy but probably the best definition in this context is 
the Chaitin - Kolmogorov entropy @, H @] : the entropy 
of a string of characters is the length (in bits) of the small- 
est program which produces as output the string. This 
definition is really abstract. In particular it is impossible, 
even in principle, to find such a program. Nevertheless 
there are algorithms explicitly conceived to approach this 
theoretical limit. These are the file compressors or zip- 
pers. A zipper takes a file and try to transform it in the 
shortest possible file. Obviously this is not the best way 
to encode the file but it represents a good approxima- 
tion of it. One of the first compression algorithms is the 
Lempel and Ziv algorithm (LZ77) [|| (used for instance 
by gzip, zip and Stacker). It is interesting to briefly re- 
call how it works. The LZ77 algorithm finds duplicated 
strings in the input data. More precisely it looks for the 
longest match with the beginning of the lookahead buffer 
and outputs a pointer to that match given by two num- 
bers: a distance, representing how far back the match 
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starts, and a length, representing the number of match- 
ing characters. For example, the match of the sequence 
G\ ... cr„ will be represented by the pointer (d,n), where 
d is the distance at which the match starts. The match- 
ing sequence will be then encoded with a number of bits 
equal to (log 2 (d) + log 2 (nj): i.e. the number of bits nec- 
essary to encode d and n. Roughly speaking the average 
distance between two consecutive <j\ ... a n is of the or- 
der of the inverse of its occurrence probability. Therefore 
the zipper will encode more frequent sequences with few 
bytes and will spend more bytes only for rare sequences. 
The LZ77 zipper has the following remarkable property: 
if it encodes a sequence of length L emitted by an ergodic 
source whose entropy per character is s, then the length 
of the zipped file divided by the length of the original 
file tends to s when the length of the text tends to oo 
(see ||, [U) and reference therein). In other words it 
does not encode the file in the best way but it does it 
better and better as the length of the file increases. 

The compression algorithms provide then a powerful 
tool for the measure of the entropy and the fields of ap- 
plications are innumerous ranging from theory of Dynam- 
ical Systems O] to Linguistics and Genetics The 
first conclusion one can draw is therefore about the pos- 
sibility to measure the entropy of a sequence simply by 
zipping it. In this paper we exploit this kind of algo- 
rithms to define a concept of remoteness between pairs 
of sequences. 

An easy way to understand where our definitions come 
from is to recall the notion of relative entropy whose 
essence can be easily grasped with the following example. 
Let us consider two ergodic sources A and B emitting se- 
quences of and 1: A emits a with probability p and 1 
with probability 1 — p while B emits with probability q 
and 1 with probability 1 — q. As already described, the 
compression algorithm applied to a sequence emitted by 
A will be able to encode the sequence almost optimally, 
i.e. coding a with — log 2 p bits and a 1 with — log 2 (l— p) 
bits. This optimal coding will not be the optimal one for 
the sequence emitted by B. In particular the entropy per 
character of the sequence emitted by B in the coding op- 
timal for A will be —qlog 2 p — (1 — q) log 2 (l — p) while 
the entropy per character of the sequence emitted by B 
in its optimal coding is —qlog 2 q — (1 — q)log 2 (l — q). 
The number of bits per character waisted to encode the 
sequence emitted by B with the coding optimal for A is 
the relative entropy (see Kullback-Leibler [|l0)) of A and 
B, Sab = -qlog 2 ^ -(1-q) log 2 jE^- 

There exist several ways to measure the relative en- 
tropy (see for instance |l3|, Q). One possibility is of 
course to follow the recipe described in the previous ex- 
ample: using the optimal coding for a given source to 
encode the messages of another source. The path we fol- 
low is along this stream. In order to define the relative 
entropy between two sources A and B we extract a long 
sequence A from the source A and a long sequence B as 



well as a small sequence b from the source B. We create a 
new sequence A + b by simply appending b after A. The 
sequence A+b is now zipped, for example using gzip, and 
the measure of the length of b in the coding optimized 
for A will be = LA+b — La, where Lx indicates the 
length in bits of the zipped file X. The relative entropy 
Sa per character between A and B will be estimated by 

Sab = (Aa6- A b5 )/|6| (1) 

where \b\ is the number of characters of the sequence b 
and ABf,/|&| = (Ls+b — £b)/|& is an estimate of the 
entropy of the source B. 

Translated in a linguistic framework, if A and B are 
texts written in different languages, A^b is a measure of 
the difficulty for a generic person of mother tongue A 
to understand the text written in the language B. Let 
us consider a concrete example where A and B are two 
texts written for instance in English and Italian. We 
take a long English text and we append to it an Italian 
text. The zipper begins reading the file starting from the 
English text. So after a while it is able to encode op- 
timally the English file. When the Italian part begins, 
the zipper starts encoding it in a way which is optimal 
for the English, i.e. it finds all most of the matches in 
the English part. So the first part of the Italian file is 
encoded with the English code. After a while the zipper 
"learns" Italian, i.e. it tends progressively to find most 
of the matches in the Italian part with respect to the En- 
glish one, and changes its rules. Therefore if the length 
of the Italian file is "small enough" [jl5| , i.e. if most of 
the matches occur in the English part, the expression 
(|l|) will give a measure of the relative entropy. We have 
checked this method on sequences for which the relative 
entropy is known, obtaining an excellent agreement be- 
tween the theoretical value of the relative entropy and 
the computed value |L5| , The results of our experiments 
on linguistic corpora turned out to be very robust with 
respect to large variations on the size of the file b (typi- 
cally 1 — 15 Kilobytes (Kb) for a typical size of file A of 
the order of 32 - 64 Kb). 

These considerations open the way to many possible 
applications. Though our method is very general | jl6| in 
this paper we focus in particular on sequences of charac- 
ters representing texts, and we shall discuss in particular 
two problems of computational Linguistics: the context 
recognition and the classification of sequences corpora. 

Language recognition: Suppose we are interested 
in the automatic recognition of the language in which a 
given text X is written. The procedure we use consider 
a collection of long texts (a corpus) in as many as possi- 
ble different (known) languages: English, French, Italian, 
Tagalog .... We simply consider all the possible files ob- 
tained appending the a portion x of the unknown file X 
to all the possible other files Ai and we measure the dif- 
ferences La^x — LAi- The file for which this difference 
is minimal will select the language closest to the one of 
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AUTHOR 


N. of texts 


N. of successes 1 


N. of successes 2 


Alighieri 


8 


8 


8 


D'Annunzio 


4 


4 


4 


Deledda 


15 


15 


15 


Fogazzaro 


5 


4 


5 


Guicciardini 


6 


5 


6 


Macchiavelli 


12 


12 


12 


Manzoni 


4 


3 


4 


Pirandello 


11 


11 


11 


Salgari 


11 


10 


10 


Svevo 


5 


5 


5 


Verga 


9 


7 


9 


TOTALS 


90 


84 


89 



TABLE I: Authorship attribution: For each author de- 
picted we report the number of different texts considered and 
two measures of success. Number of success 1 and 2 are the 
numbers of times another text of the same author was ranked 
in the first position or in one of the first two positions respec- 
tively. 



the X file, or exactly its language, if the collection of 
languages contained this language. We have considered 
in particular a corpus of texts in 10 official languages of 
the European Union (UE) |17 : Danish, Dutch, English, 
Finnish, French, German, Italian, Portuguese, Spanish 
and Swedish. Each text of the corpus played in turn the 
role of the text X and all the others the role of the Ai. 
Using in particular 10 texts per language (giving a to- 
tal corpus of 100 texts) we have obtained that for any 
single text the method has recognized the language: this 
means that for any text X the text Ai for which the dif- 
ference LAi+x — was minimum was a text written in 
the same language. Moreover it turned out that ranking 
for each X all the texts A{ as a function of the difference 
La-i+x — Lah all the texts written in the same language 
were in the first positions. The recognition of the lan- 
guage works quite well for length of the X file as small 
as 20 characters. 

Authorship attribution: Suppose in this case to be 
interested in the automatic recognition of the author of a 
given text X. We shall consider as before a collection, as 
large as possible, of texts of several (known) authors all 
written in the same language of the unknown text and we 
shall look for the text Ai for which the difference La^x — 
LAi is minimum. In order to collect a certain statistics 
we have performed the experiment using a corpus of 90 
different texts [fi8[ , using for each run one of the texts 
in the corpus as unknown text. The results, shown in 
Table , feature a rate of success of 93.3%. This rate is the 
ratio between the number of texts whose author has been 
recognized (another text of the same author was ranked 
as first) and the total number of texts considered. 

The rate of success increases by considering more re- 




Romani Balkan [East Europe] 

Occitan-Auvergnat [France] - 

Walloon [Belgique] 

Corsican [France] 

Italian [Italy] 

Sammarinese [Italy] 

Rhaeto Romance [Switzerland] 

Friulian [Italy] 

French [France] 

Catalan [Spain] 

Occitan [France] 

Asturian [Spain] 

Spanish [Spain] 

Galician [Spain] 

Portuguese [Portugal] 

Sardinian [Italy] 

Romanian [Romania] 

Romani Vlach [Macedonia] 

English [UK] 
Maltese [Malta] 
Welsh [UK] 
Irish Gaelic [Eire] 
Scottish Gaelic [UK] - 
Breton [France] 
Faroese [Denmark] 
Icelandic [Iceland] 
Swedish [Sweden] 
Danish [Denmark] 
Norwegian Bokmal [Norway] 
Norwegian Nynorsk [Norway] 
Luxembourgish [Luxembourg] 
German [Germany] 
Frisian [Netherlands] 
Afrikaans 

Dutch [Netherlands] 

Finnish [Finland] - 

Estonian [Estonia] 

Turkish [Turkey] 
Uzbek [Utzbekistan] — 
Hungarian [Hungary] 
Basque [Spain] 

Slovak [Slovakia] 

Czech [Czech Rep.] 
Bosnian [Bosnia] 
Serbian [Serbia] 
Croatian [Croatia] 
Slovenian [Slovenia] 
Polish [Poland] 

Sorbian [Germany] 

Lithuanian [Lithuania] 

Latvian [Latvia] 
Albanian [Albany] 



ROMANCE 



CELTIC 



GERMANIC 



UGRO-FINNIC 
ALTAIC 



SLAVIC 
BALTIC 



FIG. 1: Language Tree: This figure illustrates the 
phylogenetic-like tree constructed on the basis of more than 
50 different versions of the "The Universal Declaration of Hu- 
man Rights" . The tree is obtained using the Fitch-Margoliash 
method applied to a distance matrix whose elements are com- 
puted in terms of the relative entropy between pairs of texts. 
The tree features essentially all the main linguistic groups 
of the Euro-Asiatic continent (Romance, Celtic, Germanic, 
Ugro-Finnic, Slavic, Baltic, Altaic), as well as few isolated 
languages as the Maltese, typically considered an Afro-Asiatic 
language, and the Basque, classified as a non-Indo-European 
language and whose origins and relationships with other lan- 
guages are uncertain. Notice that the tree is unrooted, i.e. 
it does not require any hypothesis about common ancestors 
for the languages. What is important is the relative posi- 
tions between pairs of languages. The branch lengths do not 
correspond to the actual distances in the distance matrix. 



fined procedures (performing for instance weighted aver- 
ages over the first m ranked texts of a given text) . There 
are of course fluctuations in the success rate for each au- 
thor and this has to be expected since the writing style is 
something difficult to grasp and define; moreover it can 
vary a lot in the production of a single author. 

Classification of sequences: Suppose to have a col- 
lection of texts, for instance a corpus containing several 
versions of the same text in different languages, and sup- 
pose to be interested in a classification of this corpus. 

One has to face two kinds of problems: the availability 
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of large collections of long texts in many different lan- 
guages and, related to it, the need of a uniform coding for 
the characters in different languages. In order to solve the 
second problem we have used for all the texts the UNI- 
CODE standard coding. In order to have the largest 
possible corpus of texts in different languages we have 
used: "The Universal Declaration of Human Rights" |2(J 
which sets the Guinness World Record for Most Trans- 
lated Document. Our method, mutuated by the phylo- 

" 221 r 



genetic analysis of biological sequences 21 



Kjfl , con- 
a matrix 



siders the construction of a distance matrix, i.e 
whose elements are the distances between pairs of texts 
We define the distance by: 



S AB = (A Ab - A Bb )/A Bb + (A Ba - A Aa )/A; 



(2) 



where A and B are indexes running on the corpus el- 
ements and the normalization factors are chosen in or- 
der to be independent of the coding of the original files. 
Moreover, since the relative entropy is not a distance in 
the mathematical sense, we make the matrix elements 
satisfying the triangular inequality. It is important to 
remark that a rigorous definition of distance between 
two bodies of knowledge has been proposed by Li and 
Vitanyi ([5. Starting from the distance matrix one can 
build a tree representation: phylogenetic trees ]23[ ] , span- 
ning trees etc. In our example we have used the Fitch- 
Margoliash method [pi] of the package PhyllP (Phy- 
logeny Inference Package) which basically constructs 
a tree by minimizing the net disagreement between the 
matrix pairwise distances and the distances measured on 
the tree. Similar results have been obtained with the 
Neighbor algorithm |25| . In Fig. 1 we show the results for 
over 50 languages widespread on the Euro- Asiatic con- 
tinent. We can notice that essentially all the main lin- 
guistic groups (Ethnologue source PqQ are recognized: 
Romance, Celtic, Germanic, Ugro-Finnic, Slavic, Baltic, 
Altaic. On the other hand one has isolated languages as 
the Maltese which is typically considered an Afro-Asiatic 
language and the Basque which is classified as a non- 
Indo-European language and whose origins and relation- 
ships with other languages are uncertain. 

Needless to say how a careful investigation of specific 
linguistics features is out of our purposes. In this frame- 
work we are only interested to present the potentiality of 
the method for several disciplines. 

In conclusion we have presented here a general method 
to recognize and classify automatically sequences of char- 
acters. We have discussed in particular the application 
to textual corpora in several languages. We have shown 
how a suitable definition of remoteness between texts, 
based on the concept of relative entropy, allows to ex- 
tract from a text several important informations: the 
language in which it is written, the subject treated as 
well as its author; on the other hand the method al- 
lows to classify sets of sequences (a corpus) on the basis 



of the relative distances among the elements of the cor- 
pus itself and organize them in a hierarchical structure 
(graph, tree, etc.) The method is highly versatile and 
general. It does apply to any kind of corpora of character 
strings independently of the type of coding behind them: 
time sequences, language, genetic sequences (DNA, pro- 
teins etc). It does not require any a priori knowledge 
of the corpus under investigation (alphabet, grammar, 
syntax) nor about its statistics. These features are po- 
tentially very important for fields where the human in- 
tuition can fail: DNA and protein sequences, geological 
time series, stock market data, medical monitoring, etc. 
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